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ABSTRACT 


Intelligent Tutoring Systems (ITSs) have been developed to 
provide students with personalized learning experiences by 
adaptively generating learning paths optimized for each indi- 
vidual. Within the vast scope of ITS, score prediction stands 
out as an area of study that enables students to construct 
individually realistic goals based on their current position. 
Via the expected score provided by the ITS, a student can 
instantaneously compare one’s expected score to one’s actual 
score, which directly corresponds to the reliability that the 
ITS can instill. In other words, refining the precision of pre- 
dicted scores strictly correlates to the level of confidence that 
a student may have with an ITS, which will evidently en- 
sue improved student engagement. However, previous stud- 
ies have solely concentrated on improving the performance 
of a prediction model, largely lacking focus on the bene- 
fits generated by its practical application. In this paper, we 
demonstrate that the accuracy of the score prediction model 
deployed in a real-world setting significantly impacts user en- 
gagement by providing empirical evidence. To that end, we 
apply a state-of-the-art deep attentive neural network-based 
score prediction model to Santa, a multi-platform English 
ITS with approximately 780K users in South Korea that 
exclusively focuses on the TOEIC (Test of English for Inter- 
national Communications) standardized examinations. We 
run a controlled A/B test on the ITS with two models, re- 
spectively based on collaborative filtering and deep atten- 
tive neural networks, to verify whether the more accurate 
model engenders any student engagement. The results con- 
clude that the attentive model not only induces high student 
morale (e.g. higher diagnostic test completion ratio, number 
of questions answered, etc.) but also encourages active en- 
gagement (e.g. higher purchase rate, improved total profit, 
etc.) on Santa. 
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1. INTRODUCTION 


The significance that standardized examinations (e.g. SAT 
and TOEIC) currently hold is to provide an objective crite- 
ria in which each individual’s academic performance is mea- 
sured. Accordingly, Intelligent Tutoring Systems (ITSs), 
which generate optimized learning paths for each student, 
often include functions such as estimating expected perfor- 
mance on standardized examinations. In this regard, mea- 
suring the expected academic performance of a student has 
become an interesting area of study in Artificial Intelligence 
in Education (AIEd). These studies focus on modelling a 
student’s understanding of a target subject based on their 
learning activities. For instance, Matrix Factorization (MF) 
[10, 22, 16, 17, 23, 7, 25, 24] is a prevalent method used 
for grade prediction, in which the latent vectors of students 
and courses are learned by factorizing a student-grade ma- 
trix into two low-rank matrices. Markov and semi-Markov 
models are also some other popular approaches for grade 
prediction [11, 7, 23]. With the advances in deep learn- 
ing, neural network based models with deeper hidden lay- 
ers, such as Multi-Layer Perceptron, Recurrent Neural Net- 
works and Convolutional Neural Networks, were introduced 
to predict student’s academic performance [21, 9, 11, 8]. In 
[3], the Transformer-based [29] bidirectional encoder model 
was first pre-trained to predict masked assessments and then 
fine-tuned to predict exam score, resulting in a state-of-the- 
art score prediction model. Although precision of academic 
performance prediction is significant as it is directly associ- 
ated to a reliability of an ITS, previous studies have mainly 
focused on improving the accuracy of the prediction, leaving 
discussion about the benefits of precise prediction on student 
engagement fairly opaque. 


In this paper, we direct our attention towards the correlation 
between the precision of score prediction and student en- 
gagement. Our study starts by hypothesizing that students 
will show higher level of engagement if they experience a 
more precise score prediction while interacting with ITS. We 
empirically verify our hypothesis on Santa, a multi-platform 
English ITS with approximately 780K users in South Ko- 
rea that exclusively focuses on the TOEIC (Test of English 
for International Communications) Listening and Reading 
Test Preparation. In the experimental studies, we run a 
controlled A/B test with two score prediction models that 
differ in accuracy, which are respectively based on collab- 
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orative filtering with Mean Absolute Error (MAE) of 78.9 
and deep attentive neural networks with MAE of 49.8. The 
results show that the superior performing, deep attentive 
neural network based score prediction model induces more 
student engagement. These benefits range from ones that 
are derived from learning behavior (e.g. preliminary test 
completion ratio, membership rates, the average number of 
questions a student answered after the diagnostic test) to 
more active engagement (e.g. purchase rate, average rev- 
enue per user, and total profit). To the best of our knowl- 
edge, this is the first work studying the benefits of accurate 
score prediction of ITS on student engagement. 


2. RELATED WORKS 


The related works of this study can be grouped into two 
categories: academic performance prediction and student 
engagement. 


2.1 Academic Performance Prediction 
Predicting a student’s academic performance is a signifi- 
cant aspect in solving the problems within AIEd. A suc- 
cessful prediction model can be used to recommend appro- 
priate courses, provide interventions for at-risk students, 
and optimally allocate learning materials. Extensive work 
has been conducted on performance prediction, exploring a 
wide range of methodologies from simple regressions to deep 
learning. 


The most widely used methodology in grade prediction is low 
rank Matrix Factorization (MF) [10, 22, 16, 17, 23, 7, 25, 
24]. Low rank MF assumes that there is a low-dimensional 
latent space containing features that can effectively repre- 
sent both students and the academic tasks students will be 
graded on. These features can be interpreted as represen- 
tations of a student’s knowledge. We find these features 
by decomposing a student-grade matrix into a product of 
two low-rank matrices. The authors of [22] show that the 
MF-based model outperforms other course/student-specific 
regression models. [16] improved the model by assuming 
that different courses share a common latent feature space, 
since the totality of a student’s knowledge should not change 
based on the courses they are taking. 


Markov and semi-Markov models are another popular set 
of models for grade prediction [11, 7, 23]. These models 
capture the dynamic evolution of a student’s learning status 
and leverage it to effectively predict outcomes. [7] develops 
course-specific hidden Markov and semi-Markov models for 
grade prediction. [11] models student behavior in MOOCs 
by using Hidden Markov Models (HMMs) and Multinomial 
Mixture Models (MMMs) to cluster sequences of student 
actions. The study applies an LSTM model to predict the 
students’ final grades. Markov models are also used to esti- 
mate a student’s performance on educational games [28] or 
to predict student retention in MOOCs [1]. 


[21, 9, 11, 8] introduce deep-learning based prediction mod- 
els. The authors of [9] introduce two types of Bayesian deep 
learning models for grade prediction using Multi-Layer Per- 
ceptron and LSTM architectures. Their results show that 
their model outperforms several baseline models (including 
MF-based models and course-specific regression models) in 
detecting at-risk students. The authors of [3] propose As- 


sessment Modeling (AM), a pre-training method applicable 
to general ITSs. In AM, a model is first pre-trained to pre- 
dict several assessments of a student automatically made by 
ITS during one’s learning process. Their results show that a 
Transformer [29] based neural network model with AM im- 
proves model accuracy compared to the same network with 
other state-of-the-art pre-training methods (such as BERT 
[5] based word embedding and QuesNet [31] question em- 
bedding) on exam score prediction and review correctness 
prediction. 


2.2 Student Engagement 

Student engagement is also an actively studied topic in the 
field of ATEd. Several works have analyzed student engage- 
ment patterns to figure out which factors vastly impact en- 
gagement. [30] studied how people use digital textbooks 
and compare engagement patterns among high school stu- 
dents, college students, and online website viewers. [18] in- 
vestigated student engagement in an online learning system 
which outperformed a traditional classroom on key indica- 
tors of engagement, such as time on-task, engaged concen- 
tration, and boredom. [26] found correlations between se- 
mantic features of mathematics problems and indicators of 
engagement. [14] discriminated behavioral engagement and 
cognitive engagement, and argued that most of students who 
were behaviorally engaged were not cognitively engaged. 


Another line of student engagement research focused on pre- 
dicting engagement level. [20] proposed a two-phased ap- 
proach for automatic engagement detection, which utilized 
contextual logs and appearance information to infer behav- 
ioral engagement. [19] investigated the relationship between 
engagement and performance. Firstly, this work analyzed 
log traces for each learner to calculate engagement indica- 
tors that represent learner’s engagement level. Based on the 
quantified engagement indicators, prediction on the learner’s 
performance were attempted. 


Enrollment is a sign of strong engagement since it involves 
determination that a student must invest to take a certain 
course. Accordingly, predicting and promoting enrollment 
is highly relevant to student engagement research. [6] pro- 
posed a novel extension of Factorization Machines to in- 
fer students’ course enrollment information from incomplete 
data. [2] presented a course enrollment recommender system 
which recommended selective and optional courses based 
on students’ skills, knowledge and interests. [27] identified 
factors that affect the likelihood of enrolling. This work 
analyzed the enrollment predictability of such factors us- 
ing logistic regression, support vector machines, and semi- 
supervised probability methods. 


With the development of Massive Open Online Courses (MO 
OCs), several works studied student engagement ina MOOCs 
environment. [12] proposed a recommender which provides 
each student with an individual list of contacts based on 
their own profile and activities to foster their engagement 
in MOOCs. [15] investigated the relationship between stu- 
dents’ self-evaluation of their previous knowledge and stu- 
dents’ engagement behaviors in MOOCs through a polyto- 
mous item response theory model. 


3. SCORE PREDICTION MODELS 
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Figure 1: Pre-training/fine-tuning scheme of Assessment Modeling for score prediction. First, a model is 
pre-trained to predict two assessments: response correctness and timeliness. After pre-training, the last layer 
of the model is replaced with a layer with randomly initialized weights and appropriate dimension for score 
prediction. The parameters in the model are fine-tuned to predict exam scores. 


Our studies are based on comparing the two approaches for 
score prediction: a collaborative filtering based approach 
and Assessment Modeling. The following subsections briefly 
cover each approach. More detailed descriptions can be 
found in [13] and [3]. 


3.1 Collaborative Filtering based Approach 
There are two phases in the Collaborative Filtering (CF)- 
based score prediction approach. First, the CF-based model 
developed in [13] estimates the probability that a student re- 
sponds correctly to each potential question. In this model, 
each user or question is represented as a k-dimensional la- 
tent vector, where k is the number of hidden concepts. For 
instance, if there are n users with m questions, we have user 
vectors D1, L2,--- ,L, and question vectors Ri, Re,--: ,Rm 
each with dimension k. The knowledge level of user i under- 
standing question 7 is represented as X;; = L;-R,;. Accord- 
ingly, the probability of user 7 getting question j correct is 
modeled as 


1— da 
1 + en bc(Xij— 40)’ 


O(Xiz) = ba 4 
where ¢a, oo, and ¢. are parameters appropriately set, in- 
dependently of questions or users. The learning algorithm 
then finds the maximum likelihood estimator by minimiz- 
ing the negative of log-likelihood of observed user-question 
entries with Frobeinus norm regularizer terms through the 
projected stochastic gradient descent. 


Given the response correctness probabilities calculated from 
the CF-based model, scores for Listening Comprehension 
(LC) and Reading Comprehension (RC) are calculated through 
the following quadratic equations 


scorerc = 02x10 + MaLc + 


2 
scorerc = O5tRc + O4rRc + 43, 


where tic and xrc are each the average of predicted re- 
sponse correctness probability of potential questions in LC 
and RC, and @s are properly set parameters. The final score 
is the sum of scorezc and scorerc. 


3.2 Assessment Modeling 

[3] introduced Assessment Modeling (AM), a fundamental 
pre-training method for general class of ITSs. The moti- 
vation behind the works of AM is to deal with label-scarce 
problems in AIEd. Score prediction is a typical example 
of such label-scarce educational problems since standard- 
ized exam scores are not obtainable within ITS. Collecting 
the exam scores involves student action taken outside ITS. 
The approach proposed in [3] is based on a pre-training/fine- 
tuning paradigm. In the pre-training phase, the Transformer- 
based [29] bidirectional encoder model is trained to predict 
randomly masked assessments, which are interactive educa- 
tional features available in ITS. Examples of these assess- 
ments include response correctness (whether a student pro- 
vides a correct response to a given question) and timeliness 
(Whether a student responds to each question within the 
time limit specified by domain experts). In the fine-tuning 
phase, the last layer of the pre-trained model is replaced with 
a randomly initialized layer with an appropriate dimension 
for a specific downstream task. Afterwards, the parameters 
in the model are updated to predict labels in the downstream 
task. In the experimental studies conducted on EdNet [4], 
AM outperformed pre-training methods that learn the con- 
tents of learning materials in several downstream tasks in- 
cluding score prediction. See Figure 1 for graphical descrip- 
tion of AM. 


4. EXPERIMENTS 


4.1 Santa service 
Santa is a multi-platform English ITS with approximately 
780K users in South Korea that exclusively focuses on the 
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Figure 2: The flow of score prediction. 


TOEIC (Test of English for International Communications) 
standardized examinations. TOEIC is an English proficiency 
test that consists of two timed sections (listening and read- 
ing) each with 100 questions that adds up to a combined 
total score between 0 to 990. Santa utilizes several AI tech- 
niques to optimize the preparation process of the TOEIC 
examination for students. When the application is first ini- 
tiated, a preliminary placement test with 7 to 11 problems 
is given to diagnose the student’s current state and predict 
their expected score in real-time. After the diagnostic test, a 
user response prediction model is used to dynamically sug- 
gest problems which corresponds to the student’s current 
position within the TOEIC ladder. The prediction model 
is calculated by computing a user’s overall correctness rate, 
eliminating problems that students have answered correctly 
with high probability and then selecting the best possible 
content based on expert heuristics. Based on the user’s pre- 
vious data, the predicted scores can be provided in various 
forms throughout the service, as shown in Figure 3. Figure 
2 shows the flow of score prediction. 
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Figure 3: Various representations of predicted 
scores within the application. 


4.2 Performance of Score Prediction Model 
Santa has previously used a CF-based model for score pre- 
diction which has recently been replaced with a deep at- 
tentive model. To train the model, we aggregate the real 
TOEIC scores reported by users of Santa. Santa offered to 
reward to the users who have reported their score and was 
able to obtain a total of 2,594 score reports for 6 months. 
The data is then divided into a training set (1,302 users, 1815 
labels), validation set (244 users, 260 labels), and a test set 
(466 users, 519 labels). We use EdNet as pre-training task 
data and the student sequence data as the label (TOEIC 
score). Table 4.2 shows the MAE (Mean Absolute Error) of 
the two models for the test set. 


CF Deep Attentive model 
MAE 78.91 49.84 


Table 1: MAE of collaborative filtering and attentive 
model 


4.3 A/B test setup 

From February 24th to April 2nd, we conducted an A/B 
test by randomly administering two different score predic- 
tion algorithms to the application users: one based on a 
collaborative-filtering algorithm and another one based on 
deep-learning. 50,451 students were allocated to the collabo- 
rative filtering algorithm and 17,019 students were provided 
a deep-learning algorithm. We analyzed each student’s re- 
sponse and action (such as time of registration, question 
response time, purchase rate, etc.) to spot any noteworthy 
statistics that can validate our experiment. 


4.4 Experimental Results 
In this section, we discuss how a high quality of the predicted 
scores can significantly impact student morale. 


4.4.1 Student Motivation 

Our first test statistic is the preliminary test completion 
ratio. The completion rate of the initial placement test is a 
crucial indicator that could represent a student’s motivation, 
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Figure 4: Comparison of the number of questions solved per day between the users of the A/B test. 


as only students who are willing to learn will try to finish 
their diagnostic test. For each question a student answers 
in SANTA, a predicted score that is updated in real-time is 
projected on the top left corner. This allows for the user to 
immediately check the quality of the expected score, thus 
strengthening the trust that the user may have with the 
application. A/B test results show that the deep attentive 
model has a higher completion rate of 64.93% than the CF- 
model with 65.90%. 


Next, we look at changes in membership rates. A member- 
ship rate of an application in a sense signifies greater mag- 
nitude of student motivation than the completion rate as it 
directly indicates the determination of a user who wishes to 
use the application. Out of a total of 67,470 users that have 
used Santa during the A/B test period, 44,297 users finished 
their diagnostic tests and 28,065 users have registered to sign 
up with the application. The A/B test shows that the deep 
attentive model has a registration rate of 43.13% while the 
CF-based model has 44.55%. 


The average number of questions a user answered after the 
diagnostic test is also significant proof of a student’s educa- 
tional drive. The A/B test results show that with a deep 
attentive model a student solved an average of 22.73 ques- 
tions, while with a CF-based model the user only solved 
20.03. Figure 4 shows the comparison of the number of 
questions answered per day between the users of the A/B 
test. The x-axis represents the date and the y-axis repre- 
sents the gap between average number of questions answered 
in a deep attentive model and a CF-based model. If the gap 
is positive, the former model has on average more questions 
solved, and vice versa. We can observe that more questions 
from the deep attentive model were solved mostly through- 
out the A/B test time period. 


CF Deep Attentive model 


Completion rate (%) 64.93 65.90 
Registration rate (%) 43.13 44.55 
## of solved questions 20.03 22.73 


Table 2: Experimental results of student motivation 


4.4.2 Active Student Engagement 

In this section, we demonstrate active student engagement 
based on different score prediction models via taking a look 
at the financial benefits the models bring. Monetary prof- 
its are an essential factor in evaluating a service, since it is 


an important indicator of user engagement as a high level 
of user engagement directly results in financial success. We 
measure business impact with 3 metrics : purchase rate, Av- 
erage Revenue Per User (ARPU), and total profit. In this 
context, purchase rate is defined as the number of users that 
decided to purchase full access to the app during the A/B 
test period. The test results show that the purchase rate 
for the deep attentive model was 2.73% while the CF-based 
model had a 2.37% rate, showing a 15.19% increase for the 
deep attentive model. For ARPU, the deep attentive model 
averaged $3.23 whilst a CF-based model averaged $2.83. To- 
tal profit during testing period also yielded $162,933.88 for 
the former while it only gathered $142,949.55 for the lat- 
ter (since the two models had different parameters, these 
values were normalized based on the ratio of the model pa- 
rameters). Comparing these 3 metrics, we conclude that 
the model with higher accuracy in the deep attentive model 
shows better results as well. 


CF Deep Attentive model 
Conversion rate (%) 2.37 2.73 
ARPU (8) 2.83 3.03 
Total profit ($) 142,949.55 162,933.88 


Table 3: Experimental results of student engage- 
ment 


5. CONCLUSIONS 


Recent developments in ITS have enabled customized educa- 
tion by suggesting optimal strategies for individual students 
to approach studying. SANTA has also assisted its users 
to better prepare for the TOEIC English fluency standard- 
ized examinations by utilizing various learning techniques. 
Recently, SANTA has shifted from a collaborative-filtering 
model to a deep attentive model that has proved to be an 
upgrade over the former. To inquire about the benefits of 
using a fastidious model, this paper conducts various experi- 
ments and investigates their results. Analyzing the results of 
various experiments leads us to believe that deep attentive 
model entails a higher level of student motivation and en- 
gagement. Therefore, we claim that a more accurate model, 
in this case, the deep attentive model, could induce improved 
student engagement. 
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